Over tting Explained
نویسندگان
چکیده
Over tting arises when model components are evaluated against the wrong reference distribution. Most modeling algorithms iteratively nd the best of several components and then test whether this component is good enough to add to the model. We show that for independently distributed random variables, the reference distribution for any one variable underestimates the reference distribution for the the highest-valued variable; thus variate values will appear signi cant when they are not, and model components will be added when they should not be added. We relate this problem to the well-known statistical theory of multiple comparisons or simultaneous inference. 1 Iterative Modeling Algorithms Iterative modeling algorithms (IMAs) generate a search space M of models by repeatedly selecting a model m( ) 2 M and adding a component ci from a list of components C = c1; c2; :::; cn to m( ), producing m( ; ci). For example, m( ) may be the regression equation ŷ = 3c3 + 1c1, and m( ; c5) is ŷ = 3c3 + 1c1 + 5c5. Generally, IMAs do not add every possible component to each model m( )|this would result in exhaustive search|but rather, they add the component that appears best according to some evaluation function xi = V(ci; m( );S). We call xi the score of component ci given model m( ) and a sample of data S. For example, V might compute information gain or classi cation accuracy for decision tree induction algorithms, F ratios for stepwise multiple regression algorithms, and so on. We may de ne a general IMA algorithm as follows: IMA: Initially,M contains the empty model m(). Now iterate: 1. Select a model m( ) 2 M 2. Remove components from C on logical grounds if necessary, producing C. For example, regression models shouldn't contain multiple occurrences of the same variable; whereas decision trees can in some circumstances. 3. Find the best component, cmax 2 C , the one with the highest value xmax = max(x1; x2; :::xn), where xi = V(ci; m( );S) 4. If xmax > TV , where TV is a possibly dynamic threshold value, then add cmax to m( ).
منابع مشابه
Overtting Explained: a Case Study 1 Introduction
Over tting is a widely observed pathology of induction algorithms. Over tted models contain unnecessary structure that re ects nothing more than random variation in the data sample used to construct the model. Such models are less e cient to store and use than their correctly-sized counterparts. Using these models requires the collection of unnecessary data. Portions of over tted models are wro...
متن کاملOptimal Weight Decay in a Perceptron
Weight decay was proposed to reduce over tting as it often appears in the learning tasks of arti cial neural networks. In this paper weight decay is applied to a well de ned model system based on a single layer perceptron, which exhibits strong over tting. Since the optimal non-over tting solution is known for this system, we can compare the effect of the weight decay with this solution. A stra...
متن کاملBayesian Approaches for Overdispersion in Generalized Linear Models
Generalized linear models (GLM's) have been routinely used in statistical data analysis. The evolution of these models as well as details regarding model tting, model checking and inference is thoroughly documented in McCullagh and Nelder (1989). However, in many applications, heterogeneity in the observed samples is too large to be explained by the simple variance function which is implicit in...
متن کاملSlice Gibbs Sampling for Simulation Based Fitting of Spatial Data Models
An auxiliary variable method which we refer to as a slice Gibbs sampler is shown to provide an attractive simulation-based model tting strategy for tting Bayesian models under proper priors. Though broadly applicable, we illustrate in the context of tting spatial models for geo-referenced or point source data. Spatial modeling within a Bayesian framework ooers inferential advantages and the sli...
متن کاملProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of two{mode and co-occurrence data, which has applications in information retrieval and ltering, natural language processing, machine learning from text, and in related areas. Compared to standard Latent Semantic Analysis which stems from linear algebra and performs a Singular Value Decomposition of co-occu...
متن کاملA Dynamic Finite Element Surface
This paper presents a physics-based approach to anatomical surface segmentation, reconstruction, and tracking in multidimensional medical images. The approach makes use of a dynamic \balloon" model|a spherical thin-plate under tension surface spline which deforms elastically to t the image data. The tting process is mediated by internal forces stemming from the elastic properties of the spline ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997